Framingham
Not All Instances Are Equally Valuable: Towards Influence-Weighted Dataset Distillation
Deng, Qiyan, Zheng, Changqian, Qiao, Lianpeng, Wang, Yuping, Chai, Chengliang, Cao, Lei
Dataset distillation condenses large datasets into synthetic subsets, achieving performance comparable to training on the full dataset while substantially reducing storage and computation costs. Most existing dataset distillation methods assume that all real instances contribute equally to the process. In practice, real-world datasets contain both informative and redundant or even harmful instances, and directly distilling the full dataset without considering data quality can degrade model performance. In this work, we present Influence-Weighted Distillation IWD, a principled framework that leverages influence functions to explicitly account for data quality in the distillation process. IWD assigns adaptive weights to each instance based on its estimated impact on the distillation objective, prioritizing beneficial data while downweighting less useful or harmful ones. Owing to its modular design, IWD can be seamlessly integrated into diverse dataset distillation frameworks. Our empirical results suggest that integrating IWD tends to improve the quality of distilled datasets and enhance model performance, with accuracy gains of up to 7.8%.
- North America > United States > Massachusetts > Middlesex County > Framingham (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > United States > Arizona (0.04)
- Asia > China > Beijing > Beijing (0.04)
Semantic Integrity Constraints: Declarative Guardrails for AI-Augmented Data Processing Systems
Lee, Alexander W., Chan, Justin, Fu, Michael, Kim, Nicolas, Mehta, Akshay, Raghavan, Deepti, Cetintemel, Ugur
The emergence of AI-augmented Data Processing Systems (DPSs) has introduced powerful semantic operators that extend traditional data management capabilities with LLM-based processing. However, these systems face fundamental reliability (a.k.a. trust) challenges, as LLMs can generate erroneous outputs, limiting their adoption in critical domains. Existing approaches to LLM constraints--ranging from user-defined functions to constrained decoding--are fragmented, imperative, and lack semantics-aware integration into query execution. To address this gap, we introduce Semantic Integrity Constraints (SICs), a novel declarative abstraction that extends traditional database integrity constraints to govern and optimize semantic operators within DPSs. SICs integrate seamlessly into the relational model, allowing users to specify common classes of constraints (e.g., grounding and soundness) while enabling query-aware enforcement and optimization strategies. In this paper, we present the core design of SICs, describe their formal integration into query execution, and detail our conception of grounding constraints, a key SIC class that ensures factual consistency of generated outputs. In addition, we explore novel enforcement mechanisms, combining proactive (constrained decoding) and reactive (validation and recovery) techniques to optimize efficiency and reliability. Our work establishes SICs as a foundational framework for trustworthy, high-performance AI-augmented data processing, paving the way for future research in constraint-driven optimizations, adaptive enforcement, and enterprise-scale deployments.
- North America > United States > New York > New York County > New York City (0.05)
- North America > United States > Rhode Island (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- (10 more...)
- Health & Medicine (1.00)
- Information Technology > Software (0.81)
- Information Technology > Databases (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.54)
A New Flexible Train-Test Split Algorithm, an approach for choosing among the Hold-out, K-fold cross-validation, and Hold-out iteration
Bami, Zahra, Behnampour, Ali, Doosti, Hassan
Artificial Intelligent transformed industries, like engineering, medicine, finance. Predictive models use supervised learning, a vital Machine learning subset. Crucial for model evaluation, cross-validation includes re-substitution, hold-out, and K-fold. This study focuses on improving the accuracy of ML algorithms across three different datasets. To evaluate Hold-out, Hold-out with iteration, and K-fold Cross-Validation techniques, we created a flexible Python program. By modifying parameters like test size, Random State, and 'k' values, we were able to improve accuracy assessment. The outcomes demonstrate the Hold-out validation method's persistent superiority, particularly with a test size of 10%. With iterations and Random State settings, hold-out with iteration shows little accuracy variance. It suggests that there are variances according to algorithm, with Decision Tree doing best for Framingham and Naive Bayes and K Nearest Neighbors for COVID-19. Different datasets require different optimal K values in K-Fold Cross Validation, highlighting these considerations. This study challenges the universality of K values in K-Fold Cross Validation and suggests a 10% test size and 90% training size for better outcomes. It also emphasizes the contextual impact of dataset features, sample size, feature count, and selected methodologies. Researchers can adapt these codes for their dataset to obtain highest accuracy with specific evaluation.
- Europe > Italy > Piedmont > Turin Province > Turin (0.04)
- Oceania > Australia (0.04)
- North America > United States > Massachusetts > Middlesex County > Framingham (0.04)
- (3 more...)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Cross Validation (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
Petal-X: Human-Centered Visual Explanations to Improve Cardiovascular Risk Communication
Rojo, Diego, Lamqaddam, Houda, Gosak, Lucija, Verbert, Katrien
Cardiovascular diseases (CVDs), the leading cause of death worldwide, can be prevented in most cases through behavioral interventions. Therefore, effective communication of CVD risk and projected risk reduction by risk factor modification plays a crucial role in reducing CVD risk at the individual level. However, despite interest in refining risk estimation with improved prediction models such as SCORE2, the guidelines for presenting these risk estimations in clinical practice remained essentially unchanged in the last few years, with graphical score charts (GSCs) continuing to be one of the prevalent systems. This work describes the design and implementation of Petal-X, a novel tool to support clinician-patient shared decision-making by explaining the CVD risk contributions of different factors and facilitating what-if analysis. Petal-X relies on a novel visualization, Petal Product Plots, and a tailor-made global surrogate model of SCORE2, whose fidelity is comparable to that of the GSCs used in clinical practice. We evaluated Petal-X compared to GSCs in a controlled experiment with 88 healthcare students, all but one with experience with chronic patients. The results show that Petal-X outperforms GSC in critical tasks, such as comparing the contribution to the patient's 10-year CVD risk of each modifiable risk factor, without a significant loss of perceived transparency, trust, or intent to use. Our study provides an innovative approach to the visualization and explanation of risk in clinical practice that, due to its model-agnostic nature, could continue to support next-generation artificial intelligence risk assessment models.
- Europe > Slovenia > Drava > Municipality of Maribor > Maribor (0.04)
- Europe > Belgium > Flanders > Flemish Brabant > Leuven (0.04)
- North America > United States > Massachusetts > Middlesex County > Framingham (0.04)
- (2 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Comparison of Machine Learning Classification Algorithms and Application to the Framingham Heart Study
The use of machine learning algorithms in healthcare can amplify social injustices and health inequities. While the exacerbation of biases can occur and compound during the problem selection, data collection, and outcome definition, this research pertains to some generalizability impediments that occur during the development and the post-deployment of machine learning classification algorithms. Using the Framingham coronary heart disease data as a case study, we show how to effectively select a probability cutoff to convert a regression model for a dichotomous variable into a classifier. We then compare the sampling distribution of the predictive performance of eight machine learning classification algorithms under four training/testing scenarios to test their generalizability and their potential to perpetuate biases. We show that both the Extreme Gradient Boosting, and Support Vector Machine are flawed when trained on an unbalanced dataset. We introduced and show that the double discriminant scoring of type I is the most generalizable as it consistently outperforms the other classification algorithms regardless of the training/testing scenario. Finally, we introduce a methodology to extract an optimal variable hierarchy for a classification algorithm, and illustrate it on the overall, male and female Framingham coronary heart disease data.
- North America > United States > New York (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- North America > United States > Virginia > Alexandria County > Alexandria (0.04)
- (9 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
CDRH Seeks Public Comment: Digital Health Technologies for Detecting Prediabetes and Undiagnosed Type 2 Diabetes
This document provides responses to the FDA's request for public comments (Docket No FDA 2023 N 4853) on the role of digital health technologies (DHTs) in detecting prediabetes and undiagnosed type 2 diabetes. It explores current DHT applications in prevention, detection, treatment and reversal of prediabetes, highlighting AI chatbots, online forums, wearables and mobile apps. The methods employed by DHTs to capture health signals like glucose, diet, symptoms and community insights are outlined. Key subpopulations that could benefit most from remote screening tools include rural residents, minority groups, high-risk individuals and those with limited healthcare access. Capturable high-impact risk factors encompass glycemic variability, cardiovascular parameters, respiratory health, blood biomarkers and patient reported symptoms. An array of non-invasive monitoring tools are discussed, although further research into their accuracy for diverse groups is warranted. Extensive health datasets providing immense opportunities for AI and ML based risk modeling are presented. Promising techniques leveraging EHRs, imaging, wearables and surveys to enhance screening through AI and ML algorithms are showcased. Analysis of social media and streaming data further allows disease prediction across populations. Ongoing innovation focused on inclusivity and accessibility is highlighted as pivotal in unlocking DHTs potential for transforming prediabetes and diabetes prevention and care.
- Europe > United Kingdom (0.14)
- Europe > Netherlands > South Holland > Rotterdam (0.05)
- Europe > Finland (0.04)
- (4 more...)
- Research Report > Promising Solution (0.48)
- Research Report > Experimental Study (0.46)
- Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (1.00)
- Government > Regional Government > North America Government > United States Government > FDA (0.57)
SeFNet: Bridging Tabular Datasets with Semantic Feature Nets
Woźnica, Katarzyna, Wilczyński, Piotr, Biecek, Przemysław
Machine learning applications cover a wide range of predictive tasks in which tabular datasets play a significant role. However, although they often address similar problems, tabular datasets are typically treated as standalone tasks. The possibilities of using previously solved problems are limited due to the lack of structured contextual information about their features and the lack of understanding of the relations between them. To overcome this limitation, we propose a new approach called Semantic Feature Net (SeFNet), capturing the semantic meaning of the analyzed tabular features. By leveraging existing ontologies and domain knowledge, SeFNet opens up new opportunities for sharing insights between diverse predictive tasks. One such opportunity is the Dataset Ontology-based Semantic Similarity (DOSS) measure, which quantifies the similarity between datasets using relations across their features. In this paper, we present an example of SeFNet prepared for a collection of predictive tasks in healthcare, with the features' relations derived from the SNOMED-CT ontology. The proposed SeFNet framework and the accompanying DOSS measure address the issue of limited contextual information in tabular datasets. By incorporating domain knowledge and establishing semantic relations between features, we enhance the potential for meta-learning and enable valuable insights to be shared across different predictive tasks.
- Europe > Poland > Masovia Province > Warsaw (0.04)
- South America > Brazil > São Paulo (0.04)
- Oceania > Australia (0.04)
- (4 more...)
Tutorial: Deep Learning + OA & DCS for Heart Disease Prediction
How likely will a person develop a heart disease condition within the next ten years? Our following BOTX tutorial shows how to create a machine learning model using neural networks, DCS (our integrated data platform), and an online assistant to predict this very question based on 15 parameters. This blog post and video are part one of the tutorials focusing on the machine learning model and DCS setup. Around 17.5 million people die each year from cardiovascular diseases (CVDs), an estimated 31% of all deaths worldwide. This statistic is expected to grow to more than 23.6 million by 2030.
Can AI Predict If Your House Is Going To Burn To The Ground?
Standing on the outskirts of Oakland, California, Attila Toth takes in the nearby forested hills. The CEO looks out on what locals call "The Town" and, in the distance, San Francisco, or "The City." Close by, Toth sees tangles of redwood, eucalyptus and oak trees – and the wildfire risk they pose. This "wildland-urban interface" isn't far from the site of the 1991 Oakland Hills Fire, which flared up suddenly in a heavily residential area. Over four days, 3,000 thousand homes were destroyed in one of the city's wealthiest neighborhoods, causing an estimated $1.5 billion in damages ($3.2 billion in today's dollars).
- North America > United States > California > San Francisco County > San Francisco (0.25)
- North America > United States > California > Alameda County > Oakland (0.25)
- North America > United States > Utah (0.05)
- (13 more...)
- Banking & Finance > Insurance (0.73)
- Energy > Renewable (0.70)
Is Machine Learning The Future Of Coffee Health Research? - AI Summary
The stories generally go like this: "a study finds drinking coffee is associated with a X% decrease in [bad health outcome]" followed shortly by "the study is observational and does not prove causation." In a new study in the American Heart Association's journal Circulation: Heart Failure, researchers found a link between drinking three or more cups of coffee a day and a decreased risk of heart failure. Led by David Kao, a cardiologist at University of Colorado School of Medicine, researchers re-examined the Framingham Heart Study (FHS), "a long-term, ongoing cardiovascular cohort study of residents of the city of Framingham, Massachusetts" that began in 1948 and has grown to include over 14,000 participants. Able to analyze massive amounts of data in a short amount of time--as well as be programmed to handle uncertainties in the data, like if a reported cup of coffee is six ounces or eight ounces--machine learning can then start to ascertain and rank which variables are most associated with incidents of heart failure, giving even observational studies more explanatory power in their findings. And indeed, when the results of the FHS machine learning analysis were compare to two other well-known studies, the Cardiovascular Heart Study (CHS) and the Atherosclerosis Risk in Communities study (ARIC), the algorithm was able "to correctly predict the relationship between coffee intake and heart failure."
- North America > United States > Massachusetts > Middlesex County > Framingham (0.28)
- North America > United States > Colorado (0.28)
- Research Report > Strength Medium (0.62)
- Research Report > Observational Study (0.62)